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(57) Abstract 

A method and apparatus for the automatic analysis, synthesis and modification of audio signals, based on an overlap-add sinusoidal 
model is disclosed. Automatic analysis of amplitude, frequency and phase parameters of the model is achieved using an analysis-by- 
synthesis procedure (108) which incorporates successive approximation, yielding synthetic waveforms which are very good approximations 
to the original waveforms. In addition, a new approach to pich-scale modification (111) allows for the use of arbitrary spectral envelope 
estimates and addresses the problems of high-frequency loss and noise amplification encountered with prior art methods. 
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AUDIO ANALYSIS/SYNTHESIS SYSTEM 



This application is a continuation-in-part of United States Serial No. 07/748,544 filed 
August 22, 1991, entitled "AUDIO ANALYSIS/SYNTHESIS SYSTEM." 

TECHNICAL FIELD 



The present invention relates to methods and apparatus for acoustic signal pro- 
cessing and especially for audio analysis and synthesis. More particularly, the present 
10 invention relates to the analysis and synthesis of audio signals such as speech or music, 
whereby time-, frequency- and pitch-scale modifications may be introduced without 
perceptible distortion. 



BACKGROUND OF THE INVENTION 

15 

For many years the most popular approach to representing speech signals para- 
metrically has been linear predictive (LP) modeling. Linear prediction is described by 
J. Makhoul, "Linear Prediction: A Tutorial Review," Proc. IEEE, vol. 63, pp. 561- 
580, April 1975. In this approach, the speech production process is modeled as a 

20 linear time-varying, all-pole vocal tract filter driven by an excitation signal repre- 
senting characteristics of the glottal waveform. While many variations on this basic 
model have been widely used in low bit-rate speech coding, the formulation known 
as pitch-excited LPC has been very popular for speech synthesis and modification as 
well. In pitch-excited LPC, the excitation signal is modeled either as a periodic pulse 

25 train for voiced speech or as white noise for unvoiced speech. By effectively separating 
and parameterizing the voicing state, pitch frequency and articulation rate of speech, 
pitch-excited LPC can flexibly modify analyzed speech as well as produce artificial 
speech given linguistic production rules (referred to as synthesis-by-rule). 

However, pitch-excited LPC is inherently constrained and suffers from well-known 

30 distortion characteristics. LP modeling is based on the assumption that the vocal 
tract may be modeled as an all-pole filter; deviations of an actual vocal tract from 
this ideal thus result in an excitation signal without the purely pulse-like or noisy 
structure assumed in the excitation model. Pitch-excited LPC therefore produces 
synthetic speech with noticeable and objectionable distortions. Also, LP modeling 

35 assumes a priori that a given signal is the output of a time- varying filter driven by an 
easily represented excitation signal, which limits its usefulness to those signals (such 
as speech) which are reasonably well represented by this structure. Furthermore, 
pitch-excited LPC typically requires a "voiced/unvoiced" classification and a pitch 
estimate for voiced SDeech: serious distortions result from errors in either procedure. 
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Time-frequency representations of speech combine the observations that much 
speech information resides in the frequency domain and that speech production is 
an inherently non-stationary process. While many different types of time-frequency 
representations exist, to date the most popular for the purpose of speech processing 
5 has been the short-time Fourier transform (STFT). One formulation of the STFT, 
discussed in the article by J. L. Flanagan and Ft. M. Golden, "Phase Vocoder," Bell 
Sys. Tech. J., vol. 45, pp. 1493-1509, 1966, and known as the digital phase vocoder 
(DPV) parameterizes speech production information in a manner very similar to LP 
modeling and is capable of performing speech modifications without the constraints 

10 of pitch-excited LPC. 

Unfortunately, the DPV is also computationally intensive, limiting its usefulness 
in real-time applications. An alternate approach to the problem of speech modifica- 
tion using the STFT is based on the discrete short-time Fourier transform (DSTFT), 
implemented using a Fast Fourier Transform (FFT) algorithm. This approach is 

15 described in the Ph.D. thesis of M. R. Portnoff, Time-Scale Modification of Speech 
Based on Short-Time Fourier Analysis, Massachusetts Institute of Technology, 1978. 
While this approach is computationally efficient and provides much of the function- 
ality of the DPV, when applied to modifications the DSTFT generates reverberant 
artifacts due to phase distortion. An iterative approach to phase estimation in the 

20 modified transform has been disclosed by D. W. Griffin and J. S. Lim in "Signal 
Estimation from Modified Short-Time Fourier Transform," IEEE Trans, on Acoust., 
Speech and Signal Processing, vol. ASSP-32, no. 2, pp. 236-242, 1984. This estima- 
tion technique reduces phase distortion, but adds greatly to the computation required 
for implementation. 

25 Sinusoidal modeling, which represents signals as sums of arbitrary amplitude- and 

frequency-modulated sinusoids, has recently been introduced as a high-quality alter- 
native to LP modeling and the STFT and offers advantages over these approaches 
for synthesis and modification problems. As with the STFT, sinusoidal modeling 
operates without an "all-pole" constraint, resulting in more natural sounding syn- 
30 thetic and modified speech. Also, sinusoidal modeling does not require the restrictive 
"source/filter" structure of LP modeling; sinusoidal models are thus capable of rep- 
resenting signals from a variety of sources, including speech from multiple speakers, 
music signals, speech in musical backgrounds, and certain biological and biomedical 
signals. In addition, sinusoidal models offer greater access to and control over speech 
35 production parameters than the STFT. 

The most notable and widely used formulation of sinusoidal modeling is the Sine- 
Wave System introduced by McAulay and Quatieri, as described in their articles 
"Speech Analysis/Synthesis Based on a Sinusoidal Representation," IEEE Trans, on 
Acoust. Sveech and Sianal Processino. vol. ASSP-34. do. 744-754. August 1986. and 
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"Speech Transformations Based on a Sinusoidal Representation," IEEE Trans, on 
Acoust, Speech and Signal Processing, vol. ASSP-34, pp. 1449-1464, December 1986. 
The Sine- Wave System has proven to be useful in a wide range of speech processing 
applications, and the analysis and synthesis techniques used in the system are well- 

5 justified and reasonable, given certain assumptions. 

Analysis in the Sine- Wave System derives model parameters from peaks of the 
spectrum of a windowed signal segment. The theoretical justification for this analysis 
technique is based on an analogy to least-squares approximation of the segment by 
constant-amplitude, constant-frequency sinusoids. However, sinusoids of this form 

10 are not used to represent the analyzed signal; instead, synthesis is implemented with 
parameter tracks created by matching sinusoids from one frame to the next and 
interpolating the matched parameters using polynomial functions. 

This implementation, while making possible many of the applications of the sys- 
tem, represents an uncontrolled departure from the theoretical basis of the analysis 

15 technique. This can lead to distortions, particularly during non-stationary portions 
of a signal. Furthermore, the matching and interpolation algorithms add to the 
computational overhead of the system, and the continuously variable nature of the 
parameter tracks necessitates direct evaluation of the sinusoidal components at each 
sample point, a significant computational obstacle. A more computationally efficient 

20 synthesis algorithm for the Sine- Wave System has been proposed by McAulay and 
Quatieri in "Computationally Efficient Sine- Wave Synthesis and its Application to 
Sinusoidal Transform Coding," Proc. IEEE InVl Conf. on Acoust, Speech and Signal 
Processing, pp. 370-373, April 1988, but this algorithm departs even farther from the 
theoretical basis of analysis. 

25 Many techniques for the digital generation of musical sounds have been studied, 

and many are used in commercially available music synthesizers. In all of these tech- 
niques a basic tradeoff is encountered; namely, the conflict between accuracy and 
generality (defined as the ability to model a wide variety of sounds) on the one hand 
and computational efficiency on the other. Some techniques, such as frequency mod- 

30 ulation (FM) synthesis as described by J. M. Chowning, "The Synthesis of Complex 
Audio Spectra by Means of Frequency Modulation," J. Audio Eng. Soc, vol. 21, 
pp. 526-534, September 1973, are computationally efficient and can produce a wide 
variety of new sounds, but lack the ability to accurately model the sounds of existing 
musical instruments. 

35 On the other hand, sinusoidal additive synthesis implemented using the DPV is 

capable of analyzing the sound of a given instrument, synthesizing a perfect replica 
and performing a wide variety of modifications. However, as previously mentioned, 
the amount of computation needed to calculate the large number of time- varying sinu- 
soidal comDonents reauired orohibits real-time svnt is usine relatively inexoensive 
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hardware. As in the case of time-frequency speech modeling, the computational 
problems of additive synthesis of musical tones may be addressed by formulating 
the DPV in terms of the DSTFT and to implement this formulation using FFT 
algorithms. Unfortunately, this strategy produces the same type of distortion when 

5 applied to musical tone synthesis as to speech synthesis. 

There clearly exists a need for better methods and devices for the analysis, synthe- 
sis and modification of audio waveforms. In particular, an analysis/synthesis system 
capable of altering the pitch frequency and articulation rate of speech and music sig- 
nals and capable of operating with low computational requirements and therefore low 

10 hardware cost would satisfy long-felt needs and would contribute significantly to the 
art. 



SUMMARY O F TJjE INVENTION 



15 The present invention addresses the above described limitations of the prior art 

and achieves a technical advance by provision of a method and structural embodiment 
comprising: an analyzer responsive to either speech or musical tone signals which 
for each of a plurality of overlapping data frames extracts and stores parameters 
which serve to represent input signals in terms of an overlap-add, quasi-harmonic 

20 sinusoidal model, and; a synthesizer responsive to the stored parameter set previously 
determined by analysis to produce a synthetic facsimile of the analyzed signal or 
alternately a synthetic audio signal advantageously modified in time-, frequency- or 
pitch-scale. 

In one embodiment of the present invention appropriate for speech signals, the 
25 analyzer determines a time-varying gain signal representative of time-varying energy 
changes in the input signal. This time-varying gain is incorporated in the synthesis 
model and acts to improve modeling accuracy during transient portions of a signal. 
Also, given isolated frames of input signal and time-varying gain signal data the 
analyzer determines sinusoidal model parameters using a frequency-domain analysis- 
30 by-synthesis procedure implemented using a Fast Fourier Transform (FFT) algorithm. 
Advantageously, this analysis procedure overcomes inaccuracies encountered with dis- 
crete Fourier transform "peak-picking" analysis as used in the Sine- Wave System, 
while maintaining a comparable computational load. Furthermore, a novel funda- 
mental frequency estimation algorithm is employed which uses knowledge gained from 
35 analysis to improve computational efficiency over prior art methods. 

The synthesizer associated with this embodiment advantageously uses a refined 
modification model, which allows modified synthetic speech to be produced without 
the objectionable artifacts typically associated with modification using the DSTFT 
and other prior art methods. In addition. overlaD-add svnthesis mav be imolemented 
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using an FFT algorithm, providing improved computational efficiency over prior art 
methods without departing significantly from the synthesis model used in analysis. 

The synthesizer also incorporates an improved phase coherence preservation algo- 
rithm which provides higher quality modified speech. Furthermore, the synthesizer 

5 performs pitch-scale modification using a phasor interpolation procedure. This pro- 
cedure eliminates the problems of information loss and noise migration often encoun- 
tered in prior art methods of pitch modification. 

In an embodiment of the present invention appropriate for musical tone signals, a 
harmonically-constrained analysis-by-synthesis procedure is used to determine appro- 

10 priate sinusoidal model parameters and a fundamental frequency estimate for each 
frame of signal data. This procedure allows for fine pitch tracking over the analyzed 
signal without significantly adding to the computational load of analysis. Due to a 
priori knowledge of pitch, the synthesizer associated with this embodiment uses a 
simple functional constraint to maintain phase coherence, significantly reducing the 
amount of computation required to perform modifications. 
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ttRTKF r>F.Sf]R1PTin N OF THE DRAWINGS 

Fig. 1 is a system level block diagram of a speech analyzer according to the present 
invention showing the required signal processing elements and their relationship to 
5 the flow of the information signals. 

Fig. 2 is a flowchart illustrating the information processing task which takes place 
in the time- varying calculator block of Fig. 1. 

Fig. 3 is an illustration of overlap-add synthesis, showing the relationship of win- 
dowed synthetic contributions and their addition to form a synthesis frame of s[n). 
10 Fig. 4 is a functional block diagram illustrating the closed-loop analysis-by- 

synthesis procedure used in the invention. 

Figs. 5 and 6 are flowcharts showing the information processing tasks achieved by 
the analysis-by-synthesis block of Fig. 1. 

Figs. 7-9 are flowcharts showing the information processing tasks achieved by the 
15 fundamental frequency estimator block of Fig. 1. 

Fig. 10 is a flowchart showing the information processing tasks achieved by the 
harmonic assignment block of Fig. 1. 

Fig. 11 is a system level block diagram of a speech analyzer according to the 
present invention similar in operation to the speech analyzer of Fig. 1 but which 
20 operates without incorporating time-varying gain sequence a[n). 

Fig. 12 is a system level block diagram of a musical tone analyzer according 
to the present invention showing the required signal processing elements and their 
relationship to the flow of the information signals. 

Figs. 13-15 are flowcharts showing the information processing tasks achieved by 
25 the harmonically-constrained analysis-by-synthesis block of Fig. 12. 

Fig. 16 is a system level block diagram of a musical tone analyzer according to 
the present invention similar in operation to the musical tone analyzer of Fig. 12 but 
which operates without incorporating time-varying gain sequence a[n). 

Fig. 17 is a system level block diagram of a speech synthesizer according to the 
30 present invention, showing the required signal processing elements and their relation- 
ship to the flow of the information signals. 

Fig. 18 is an illustration of distortion due to extrapolation beyond analysis frame 
boundaries. The phase coherence of s k [n] is seen to break down quickly outside the 
analysis frame due to the quasi-harmonic nature of the model. 
35 Fig. 19 is an illustration of the effect of differential frequency scaling in the refined 

modification model. The phase coherence of the synthetic contribution breaks down 
more slowly due to "pulling in" the differential frequencies. 

Figs. 20 and 21 are flowcharts showing the information processing tasks 
achieved bv the Ditch onset time estimator block of Fiff. 17. 
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Fig. 22 is an illustration of virtual excitation sequences in both the unmodified 
and modified cases, and of the coherence constraint imposed on the sequences at 
boundary C". 

Figs. 23 and 24 are flowcharts showing the information processing tasks 
5 achieved by the speech synthesizer DFT assignment block of Fig. 17. 

Fig. 25 is a system level block diagram of a speech synthesizer according to the 
present invention similar in operation to the speech synthesizer of Fig. 17 but which 
is capable of performing time- and pitch-scale modifications. 

Figs. 26 and 27 are flowcharts showing the information processing tasks 
10 achieved by the phasor interpolator block of Fig. 25. 

Fig. 28 is a system level block diagram of a musical tone synthesizer according 
to the present invention showing the required signal processing elements and their 
relationship to the flow of the information signals. 

Fig. 29 is a system level block diagram of a musical tone synthesizer according to 
15 the present invention similar in operation to the musical tone synthesizer of Fig. 28 
but which is capable of performing time- and pitch-scale modifications. 

Fig. 30 is a system level block diagram showing the architecture of a microproces- 
sor implementation of the audio synthesis system of the present invention. 
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p F TAlT.T?T> DESCRIPTION 

Figure 1 illustrates an analyzer embodiment of the present invention appropriate 
for the analysis of speech signals. Speech analyzer 100 of Figure 1 responds to an 
5 analog speech signal, denoted by s c (t) and received Via path 120, in order to determine 
the parameters of a signal model representing the input speech and to encode and 
store these parameters in storage element 113 via path 129. Speech analyzer 100 
digitizes and quantizes s c (t) using analog-to-digital (A/D) converter 101, according 
to the relation 

10 »N = <?{»c(n/F.)}, W 

where F, is the sampling frequency in samples/sec and £?{ •} represents the quanti- 
zation operator of A/D converter 101. It is assumed that s c (t) is bandlimited to FJ2 
Hz. 

Time-varying gain calculator 102 responds to the data stream produced by A/D 
15 converter 101 to produce a sequence <r[n] which reflects time-varying changes in the 
average magnitude of s[n}. This sequence may be determined by applying a lowpass 
digital filter to |«[n]|. One such filter is defined by the recursive relation 



20 



30 



35 



Vi[n) = \yi[n - 1] + (1 - %i_i[n], 1 < * < 7 > 



(2) 



where y 0 [n) = \s[n]\. The time-varying gain sequence is then given by 

a[n) = yj[n + n„], ( 3 ) 
where n„ is the delay in samples introduced by filtering. The frequency response of 
25 this filter is given by 



F(e*") 



( 1-* V 



(4) 



where the filter parameters A and I determine the frequency selectivity and rolloff of 
the filter, respectively. For speech analysis, a fixed value of / = 20 is appropriate, 
while A is varied as a function of F, according to 

A = .9 F '/ 8000 , ( 5 ) 
assuring that the filter bandwidth is approximately independent of the sampling fre- 
quency. The filter delay n„ can then be determined as 



* = ( 7 T3a) ■ 



(6) 
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where ( • ) represents the "round to nearest integer" operator. A flowchart of this 
algorithm is shown in Figure 2. Time-varying gain calculator 102 transmits a[n] via 
path 121 to parameter encoder 112 for subsequent transmission to storage element 
113. 

5 It should be noted that any components of s[n] with frequencies close to F 3 /2 will 

be "aliased" into low-frequency components by the absolute value operator | • |, which 
can cause distortion in a[n]. Therefore, it is advisable to apply a lowpass filter to any 
s[n] known to contain significant high-frequency energy before taking the absolute 
value. Such a filter need only attenuate frequencies near F 5 /2, thus it need not be 

10 complicated. One example is the simple filter defined by 

s'[n] = .25s[n - 1] + .5s[n] + .25s[n + 1]. (7) 

Consider now the operation of speech analyzer 100 in greater detail. The signal 
model used in the invention to represent s[n) is an overlap- add sinusoidal model 
15 formulation which produces an approximation to s[n] given by 

CO 

s[n] = c[n] J2 w 3 [n-kN s ]? : \n-kN 9 }, (8) 

*=-00 

where a[n] controls the time- varying intensity of s[n], w e [n] is a complementary syn- 
20 thesis window which obeys the constraint 



25 



£ w s [n-kN 9 ] = l, (9) 

fc=-oo 

and s*[n], the A:-th synthetic contribution, is given by 

Xk) 

**M = £4 coB^n + tf}), (10) 

where u>*j = 2nff/F 9 and where 0 < // < F s /2. The "synthesis frame length" N s 
typically corresponds to between 5 and 20 msec, depending on application require- 
30 ments. While an arbitrary complementary window function may be used for w s [n] y a 
symmetric, tapered window such as a Hanning window of the form 

35 is typically used. With this window, a synthesis frame of TV, samples of s[n) may be 
written as 

s[n + kN,) = <r[n + kN,)(w,[n]s k [n] + w,[n - N s ]s k+X [n - N,]), (12) 
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for 0 < n < N a . Figure 3 illustrates a synthesis frame and the overlapping synthetic 
sequences which produce it. 

Given cr[n], the objective of analysis is to determine amplitudes {A k -} y frequencies 
{u*?} and phases {</>*} for each $ k [n) in Equation 8 such that s[n] is a "closest approx- 
5 imation" to s[n] in some sense. An approach typically employed to solve problems of 
this type is to minimize the mean-square error 

E= f) {s[n)-s[n}f (13) 

10 in terms of the parameters of s[n). However, attempting to solve this problem simul- 
taneously for all the parameters may not be practical. 

Fortunately, if $[n] is approximately stationary over short time intervals, it is 
feasible to solve for the amplitude, frequency and phase parameters of s k [n] in isolation 
by approximating s[n] over an analysis frame of length 2N a + 1 samples centered at 

15 n = kN s . The overlapping frames of speech data and the accompanying frames of 
envelope data required for analysis are isolated from s[n] and a\n) respectively using 
frame segmenter blocks 103. The synthetic contribution s k [n] may then be determined 
by minimizing 

20 E k = £ w-lnK«[n + JbiVj-a[n + *J^J^[n]} 2 (14) 

nr=-N a 

with respect to the amplitudes, frequencies and phases of s k [n]. 

The analysis window w a [n] may be an arbitrary positive function, but is typically 
a symmetric, tapered window which serves to force greater accuracy at the frame 
25 center, where the contribution of s k [n) to s[n] is dominant. One example is the 
Hamming window, given by 

ri f .54 + .46cos(n7r/AT a ), \n\ < N a 
W -W = \ 0, otherwise. [lb) 

30 The analysis frame length may be a fixed quantity, but it is desirable in certain 
applications to have this parameter adapt to the expected pitch of a given speaker. 
For example, as discussed in U. S. Pat. No. 4,885,790, issued to R. J. McAulay et al, 
the analysis frame length may be set to 2.5 times the expected average pitch period of 
the speaker to provide adequate frequency resolution. In order to ensure the accuracy 

35 of s[n] y it is necessary that N a > N 3 . 
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Defining x[n] and g[n] by 

x[n] £ (w a [n]) l t 2 s[n + kN,} 

g[n] 4 KfnD^ln + A:^], (16) 
) and making use of Equation 10, E k may be rewritten as 



N a ( j y 

= E ^N-E^N 008 ^^^)} > ( 17 ) 

n=-* B ( j=l J 

where frame notation has been omitted to simplify the equations. Unfortunately, 
10 without a priori knowledge of the frequency parameters, this minimization problem 
is highly nonlinear and therefore very difficult to solve. 

As an alternative, a slightly suboptimal but relatively efficient analysis-by- 
synthesis algorithm may be employed to determine the parameters of each sinusoid 
successively. This algorithm operates as follows: Suppose the parameters of £ - 1 
15 sinusoids have been determined previously, yielding the successive approximation to 
x[nj, 

i-i 

£<-i[n] = g[n} ]T A j cos^n + fa), (18) 
20 and the successive error sequence 

et-i[n] = x[n] - x,_i[n]. (19) 

Given the initial conditions x 0 [n] = 0 and e 0 [n) = x[n], these sequences may be 
25 updated recursively by 

x t [n] = xi„i[n] + g[n)Aicos(L;tn + fa) 

ei[n] = e^fn] - g[n]A ( cos(u/*n + <f> t ), (20) 

for t > 1. The goal is then to minimize the squared successive error norm E £i given 
30 by 

Ei= £ {e*N} 2 = E {e^i[n]-^[n]^cos(^n + ^)} 2 (21). 

n=-N a n=~N a 

in terms of At, u)i and ^. 
35 At this point it is still not feasible to solve simultaneously for the parameters due 

to the embedded frequency and phase terms. However, assuming for the moment that 
u>i is fixed and recalling the trigonometric identity cos(a-f (3) = cos a cos /?— sin a sin /?, 
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the expression for Et becomes 

E t = Yl { e *-iN ~ a ^N cosu >t n " M n ] sin u//n} . (22) 

n=-N 0 

5 In this case the problem is clearly in the form of a linear least-squares approximation 
which when optimized in terms of a t and b t yields "normal equations" of the form 

aan + ban = Vi 

<^7l2 + ^722 = V>2> ( 23 ) 

10 where 



N a 

Tn = £ 0 2 [n]cos 2 u;<n 

15 712 = £ p 2 (n]cosu;/nsinu;/n 

722 = £ p 2 [n]sin 2 u;,n 

= 53 Q~iHp[ n ] cosu; ^ n 

20 

^ 2 = 53 [n]^[n] sin ^n. (24) 



25 



30 



35 



Solving for at and 6* gives 

a* = (722^1 -712^2)/^ 

6/ = (711^2-712^1)/ A, ( 25 ) 

where A = 711722 - 7i2- By the Principle of Orthogonality, given a t and b t , E t can 
be expressed as 

E t = Bf-i - 0^1 - &^2- ( 26 ) 
Having determined a t and At and 0/ are then given by the relations 

A, = (a? + 6?) 1/2 

<f, t = -tan" 1 {b t /a t ). (27) 

This establishes a method for determining the optimal amplitude and phase pa- 
rameters for a single sinusoidal component of s k [n] at a given frequency. To determine 
an appropriate frequency for this sinusoid, an ensemble search procedure may be 
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employed. While a variety of search strategies are possible , the most straightforward 
is an "exhaustive search." In this procedure, u)t is varied over a set of uniformly spaced 
candidate frequencies given by u c \i) = 2iir/M for 0 < i < M/2 (assuming that M is 
an even number). For each w c [i], the corresponding value of Et is calculated using 

5 Equation 26, and u>t is chosen as that value of u> c [i] which yields the minimum error. 
At and <f>t are then chosen as the amplitude and phase parameters associated with 
that frequency value. 

In order to guarantee that i/[n] converges to x[n], it is necessary that M > 2N a ] 
furthermore, in order to guarantee a level of accuracy which is independent of the 

10 analysis frame length, M should be proportional to 7V 0 , i.e. 

M = vN a , 

where v is typically greater than six. Finally, to facilitate computation it is often 

15 desirable to restrict M to be an integer power of two. For example, given the above 
conditions a suitable value of M for the case when N a = 80 would beM = 512. 

Having determined the parameters of the £-th component, the successive approxi- 
mation and error sequences are updated by Equation 20, and the procedure is repeated 
for the next component. The number of components, J[k]> may be fixed or may be 

20 determined in the analysis procedure according to various "closeness of fit" criteria 
well known in the art. Figure 4 shows a functioned block diagram of the analysis 
procedure just described, illustrating its iterative, "closed-loop" structure. 

Due to a natural high-frequency attenuation in the vocal tract referred to as 
"spectral tilt," speech signals often have energy concentrated in the low-frequency 

25 range. This phenomenon, combined with the tendency of analysis-by-synthesis to 
select components in order of decreasing amplitude and with the fact that slight 
mismatches exist between speech signals and their sinusoidal representations, implies 
that analysis-by-synthesis tends to first choose high- amplitude components at low 
frequencies, then smaller sinusoids immediately adjacent in frequency to the more 

30 significant components. This "clustering" behavior slows the analysis algorithm by 
making more iterations necessary to capture perceptually important high-frequency 
information in speech. Furthermore, low-amplitude components clustered about high- 
amplitude components are perceptually irrelevant, since they are "masked" by the 
larger sinusoids. As a result, expending extra analysis effort to determine them is 

35 wasteful. 

Two approaches have been considered for dealing with the effects of cluster- 
ing. First, since clustering is caused primarily because high-frequency components in 
speech have small amplitudes relative to low-frequency components, one solution is 
to aDolv a hieh-oass filter to s\n\ before analvsis to make hierh-freauencv components 
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15 



35 



comparable in amplitude to low-frequency components. In order to be effective, the 
high-pass filter should approximately achieve a 6 dB/octave gain, although this is not 
critical. One simple filter which works well is defined by 

s P f[n] = s[n]-.9s[n-l]. (28) 

Since in this approach the "prefiltered" signal s pf [n] is modeled instead of s[n], 
the effects of prefiltering must be removed before producing synthetic speech. This 
may be done either by applying the inverse of the filter given by Equation 28 to s[n], 
or by removing the effects from the model parameters directly, using the formulas 

A) = V|G(c**)| 

= fc-tfZ(e**), (29) 



where 

G(e>") = 1 - .9e~*\ 

A second approach to the problem of clustering is based on the observation that 
low-amplitude sinusoids tend to cluster around a high-amplitude sinusoid only in 
the frequency range corresponding to the main lobe bandwidth of W Q (e }u; ), the fre- 
quency spectrum of w a [n). Thus, given a component with frequency w t determined by 
20 analysis-by-synthesis, it may be assumed that no perceptually important components 
lie in the frequency range 

- B m ]/2 < u> < u>t + Bmi/2, (30) 

where B m \ is the main lobe bandwidth of W a {e>»). The frequency domain character- 
25 istics of a number of tapered windows are discussed by A. V. Oppenheim and R. W. 
Schafer in Discrete-Time Signal Processing, Englewood Cliffs, New Jersey: Prentice- 
Hall, 1989, pp. 447-449. Therefore, the proposed analysis-by-synthesis algorithm may 
be modified such that once a component with frequency w t has been determined, fre- 
quencies in the range given by Equation 30 are eliminated from the ensemble search 
30 thereafter, which ensures that clustering will not occur. 

The amount of computation required to perform analysis-by-synthesis is reduced 
greatly by recognizing that many of the required calculations may be performed using 
a Fast Fourier Transform (FFT) algorithm. The M-point discrete Fourier transform 
(DFT) of an M-point sequence x[n) is defined by 



X[m) = x[n)W% n , 0 < m < M, (31) 

n=0 
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where 

When x[n] is a real-valued sequence the following identities hold: 

5 

M-l 

53 x[n]cos((27r/M)mn) = 3f?e{X[m]} 

n=0 
M-l 

£ x[n]sin((27r/M)mn) = -3m{X[m]}. (33) 

n=0 

10 For the purposes of analysis-by-synthesis the M-point DFT's of e/^i[n]^[n] and g 2 [n] 
are written as 

EGt.^m) = £ e t - X \n}9[n}W% n 

15 GG[m) = £ 9 2 [n]W% n . (34) 

Noting that W^ [n ^ M) = W£ n , these DFTs may be cast in the form of Equation 31 
(provided that M > 2iV 0 ) by adding M to the negative summation index values and 
zero-padding the unused index values. 
20 Consider now the inner product expressions which must be calculated in the 

analysis-by-synthesis algorithm. From Equation 24, for the case of wt = u) c [i) — 
2nr/M, 7 U is given by 

7n= E p 2 [n]cos 2 ((27r/M)m). (35) 

25 n=-N a 

Using Equation 33 and recalling that cos 2 9 = \ + \ cos 20, this becomes 

Tii = \CG[0] + \ue{GG\2%\Y (36) 
30 Similarly, expressions for 712 and 722 can also be derived: 

712 = -\*rn{GG[2t\} 

722 = \gG[0) - i»e{GG(2i]}. (37) 

35 The first three parameters may therefore be determined from the stored values of 
a single DFT which need only be calculated once per analysis frame using an FFT 
algorithm, provided that M is a highly composite number. Furthermore, if M is an 
integer power of 2, then the particularly efficient "radix- 2" FFT algorithm may be 
used. A varietv of FFT algorithms are described bv A. V. ODDenheim and R. W. 
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Schafer in Discrete- Time Signal Processing, Englewood Cliffs, New Jersey: Prentice- 
Hall, 1989. 

Similar expressions for ft and ft can be derived directly from the DFT identities 
given above: 

5 ft = Re{£G,_i(*]} ( 38 ) 

and 

ft = -*m{EG t -i[i\}. ( 39 ) 
These parameters may thus be expressed in terms of the stored values of EG t -i[m). 
However, since e,_i[n] changes for each new component added to the approximation, 
10 EG t -i\m\ must be computed J[k) times per frame. In order to reduce the amount of 
computation further, the identities described above may be used to update this DFT 
sequence. 

According to Equation 20, the updated error sequence after the i-th component, 
e<[n], is given by 

15 e t [n) = e/_i(n] - A t g[n] cos(w<n + ft). (40) 

From this it is clear that the updated DFT EG t [m] is then 

EG t [m] = EG t -Am]-A< £ *»M + \^'^)wT- (41) 

n——Nc 

Recalling that w e = 2-nii/M, this becomes 
EG e [m] = EG t -i[m] - \A t e>*<GG[((m - i t )) M ] - \A t e~»'GG\{{m + i t )) M ], (42) 

25 where (( • )) M denotes the "modulo M" operator. EG t [m] can therefore be expressed 
as a simple linear combination of £G/_i[m] and circularly shifted versions of GG[m); 
this establishes a fast method of analysis-by-synthesis which operates in the frequency 
domain. A flowchart of this algorithm is given in Figures 5 and 6. 

It should be apparent to those skilled in the art that there are occasions when 

30 EG t [m] will be a useful quantity in and of itself. For instance, if the goal of analyzing 
a signal made up of sinusoidal components plus noise is to determine the Fourier 
transform of the noise term, then EG e [m] corresponds to this quantity after removing 
the sinusoidal signal components. 

Recalling that e 0 [n] = x[n], then according to Equation 34, 



20 



35 



EG 0 [m] = XG[m] = £ x[n]g[n]W^ n . (43) 

n=-N. 
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Substituting the definitions of x[n] and g[n] from Equation 16, XG[m] and GG[m] 
may be wntten as 

XG[m] = £ w a [n)s[n + kN,]c[n + kN 9 ]W£ n 

5 n=-N a 

GG[m] = £ w a \n)a 2 \n^kN s )WZ\ (44) 

that is, JCGtm] and GG[m], the two functions required for fast analysis-by-synthesis, 
are the zero-padded M-point DFT's of the sequences a:[n]^[n] and g 2 \n], respectively. 

10 This first sequence is the product of the speech data frame and the envelope data 
frame multiplied by the analysis window function w a [n)\ likewise, g 2 [n} is simply the 
square of the envelope data frame multiplied by w a [n]. 

Referring to Figure 1, multiplier block 104 responds to a frame of speech data 
received via path 123 and a frame of envelope data received via path 122 to produce 

15 the product of the data frames. Analysis window block 106 multiplies the output of 
multiplier block 104 by the analysis window function w a [n), producing the sequence 
x[n]p[n] described above. Squarer block 105 responds to a frame of envelope data 
to produce the square of the data frame; the resulting output is input to a second 
analysis window block to produce the sequence g 2 [n]. At this point x[n]p[n] and g 2 [n] 

20 are input to parallel Fast Fourier Transform blocks 107, which yield the M-point 
DFT's XG[m] and GG[m], respectively. Analysis-by-synthesis block 108 responds to 
the input DFT's XG[m] and GG[m] to produce sinusoidal model parameters which 
approximate the speech data frame, using the fast analysis-by-synthesis algorithm 
discussed above. The resulting parameters are the amplitudes {A*}}, frequencies {v*} 

25 and phases which produce s fc [n], as shown in Equation 10. 

System estimator 110 responds to a frame of speech data transmitted via path 

123 to produce coefficients representative of H(e jw ) } an estimate of the frequency 
response of the human vocal tract. Algorithms to determine these coefficients include 
linear predictive analysis, as discussed in U. S. Pat. No. 3,740,476, issued to B. S. 

30 Atal, and homomorphic analysis, as discussed in U. S. Pat. No. 4,885,790, issued to 
R. J. McAulay et al. System estimator 110 then transmits said coefficients 4 via path 

124 to parameter encoder 112 for subsequent transmission to storage element 113. 
In order to perform speech modifications using a sinusoidal model it is necessary 

for the frequency parameters associated with a given speech data frame to reflect 
35 the pitch information embedded in the frame. To this end, s k [n] may be written in 
quasi-harmonic form: 

j[k) 

s h [n) = £ A) cob((,«J + A*)n + (45) 
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where wj" = jwJ+Aj, and where J[fc] is now the greatest integer such that J[fc]w* < *. 
Note that only°one component is associated with each harmonic number j. With this 
formulation, the fundamental frequency = 2ir/*/F. must now be determined. 
Fundamental frequency estimator 109 responds to the analyzed model parameter 
5 set from analysis-by-synthesis block 108 and to vocal tract frequency response coeffi- 
cients received via path 124 to produce an estimate of the fundamental frequency u>* 
of Equation 45. While many approaches to fundamental frequency estimation may 
be employed, a novel algorithm which makes use of the analyzed sinusoidal model 
parameters in a fashion similar to the algorithm disclosed by McAulay and Quatieri 
10 in "Pitch Estimation and Voicing Detection Based on a Sinusoidal Speech Model," 
Proc. IEEE Int'l Conf. on Acoust, Speech and Signal Processing, pp. 249-252, April 
1990, is described here: If u>* is denned as that value of w which minimizes the error 
induced by quantizing the frequency parameters to harmonic values, 



15 



12 
(46) 
. 



n=-N, (j=0 

then u)g is approximately equal to 



9n , ,* _ i=2 (47) 

t=0 

assuming that N a is on the order of a pitch period or larger. This estimate is simply 
the average of {atf/i} weighted by (tAf ) 2 . 
25 Again suppressing frame notation, given an initial fundamental frequency esti- 

mate u' 0 = 2*f'jF„ it is possible to arrange a subset of the analyzed sinusoidal 
model parameters in the quasi-harmonic form of Equation 45 and to update the fun- 
damental frequency estimate recursively. This is accomplished by passing through 
the frequency parameters in order of decreasing amplitude and calculating each fre- 
30 quency's harmonic number, denned as («,/«*>. If this equals the harmonic number 
of any previous component, the component is assigned to the set of parameters € 
which are excluded from the quasi-harmonic representation; otherwise, the compo- 
nent is included in the quasi-harmonic set, and its parameters are used to update w 0 
according to Equation 47. Any harmonic numbers left unassigned are associated with 
35 zero-amplitude sinusoids at appropriate multiples of the final value of w 0 . 

In the case of speech signals, the above algorithm must be refined, since a reliable 
initial estimate is usually not available. The following procedure is used to define 
and choose from a set of candidate fundamental frequency estimates: Since, in condi- 
tions of low-ener<rv. wideband interference, hieh-amolitude comoonents corresoond to 



SUBSTITUTE SHEET (RULE 26) 



WO 95/30983 



19 



PCT/US95/05598 



signal components, it may be assumed that the frequency / of the highest amplitude 
component whose frequency is in the range from 100 to 1000 Hz is approximately 
some multiple of the actual pitch frequency, i.e. f 0 ~ f ji for some i. 

In order to determine an appropriate value of i, a set of values of i are determined 

5 such that f/i falls in the range from 40 to 400 Hz, the typical pitch frequency range for 
human speech. For each i in this set the recursive fundamental frequency estimation 
algorithm is performed as described above, using an initial frequency estimate of 

u/ 0 [i] = 2nf 0 [i]/F Si where fji] = f/i. Given the resulting refined estimate, a 
measure of the error power induced over the speech data frame by fixing the quasi- 

10 harmonic frequencies to harmonic values may be derived, yielding 

N 2 ( J J \ 

p f = it £(^> 2 - <*M E A >> ■ ( 48 ) 

Due to the inherent ambiguity of fundamental frequency estimates, a second error 
15 measure is necessary to accurately resolve which candidate is most appropriate. This 
second quantity is a measure of the error power induced by independently organizing 
the parameters in quasi-harmonic form and quantizing the amplitude parameters to 
an optimal constant multiple of the vocal tract spectral magnitude at the component 
frequencies, given by 

20 

'-l + 5 (49 > 

where P e is the power of the parameter set £ excluded from the quasi-harmonic 
representation, 

25 

P. = 5 £ 4 (50) 
and where r 

30 K = i>|ff(e*")|, (51) 

*=o 

V* = £|ff(e**)| 2 - < 52 > 

At this point a composite error function P T [i] is constructed as P T [i] = Pf + P a , 
35 and the refined estimate u; 0 [t] corresponding to the minimum value of Pr[i] is chosen 
as the final estimate u/ 0 . This algorithm is illustrated in flowchart form by Figures 7 
through 9. In the case where interference is sufficiently strong or narrowband that 
the analyzed component at frequency / cannot be assumed to be a signal component, 
then the algorithm described above mav still be emoloved. usincr a Dredefined set of 
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candidate frequencies which are independent of the analyzed parameters. Fundamen- 
tal frequency estimator 109 then transmits u> 0 via path 125 to parameter encoder 112 
for subsequent transmission to storage element 113. 

Harmonic assignment block 111 responds to the fundamental frequency estimate 

5 u) Q and the model parameters determined by analysis-by-synthesis to produce a quasi- 
harmonic parameter set as in Equation 45. This is accomplished by assigning each 
successive component a harmonic number given by (Uj/u„) in order of decreasing am- 
plitude, refraining from assigning components whose harmonic numbers conflict with 
those of previously assigned components. The resulting parameter set thus includes 

10 as many high-amplitude components as possible in the quasi-harmonic parameter 
set. The harmonic assignment algorithm is illustrated in flowchart form by Figure 
10. Harmonic assignment block 111 then transmits the quasi-harmonic model am- 
plitudes {A}"}, differential frequencies {Af} and phases {<#} via paths 126, 127 and 
128 respectively, to parameter encoder 112 for subsequent transmission to storage 

15 element 113. 

While the time-varying gain sequence cr[n] acts to increase model accuracy during 
transition regions of speech signals and improves the performance of analysis in these 
regions, it is not absolutely required for the model to function, and the additional 
computation required to estimate a[n] may outweigh the performance improvements 
20 for certain applications. Therefore, a second version of a speech analyzer which 
operates without said time-varying gain (equivalent to assuming that ff[n] = 1) is 

illustrated in Figure 11. 

Speech analyzer 1100 operates identically to speech analyzer 100 with the following 
exceptions: The signal path dedicated to calculating, transmitting and framing a\n\ is 
25 eliminated, along with the functional blocks associated therewith. A second difference 
is seen by considering the formulas giving DFT's XG[m) and GG\m) in Equation 44 
for the case when a[n] = 1; 

XG[m) = £ w a [n}s[n + kN s ]W3 n 

30 

GG[m) = £, w a [n)W£ n - < 53 > 

n=-N. 

That is, XG[m) is now the DFT of the speech data frame multiplied by the analysis 
window, and GG[m) is simply the DFT of the analysis window function, which may 
35 be calculated once and used as a fixed function thereafter. 

Analysis window block 1103 responds to a frame of speech data received via path 
1121 to multiply said data frame by the analysis window function w a [n] to produce 
the sequence x[n]g[n). Fast Fourier Transform block 1105 responds to x[n]g[n] to 
oroduce the M-Doint DFT XG\m\ defined above. Read-onlv memorv block 1104 
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serves to store the precalculated DFT GG[m] defined above and to provide this DFT 
to analysis-by-synthesis block 1106 as needed. All other algorithmic components of 
speech analyzer 1100 and their structural relationships are identical to those of speech 
analyzer 100. 

5 Figure 12 illustrates an analyzer embodiment of the present invention appropri- 

ate for the analysis of pitched musical tone signals. Musical tone analyzer 1200 of 
Figure 12 responds to analog musical tone signals in order to determine sinusoidal 
model parameters in a fashion similar to speech analyzer 100. Musical tone analyzer 
1200 digitizes and quantizes analog musical signals received via path 1220 using A/D 

10 converter 1201 in the same manner as A/D converter 101 

Time-varying gain calculator 1202 responds to the data stream produced by A/D 
converter 1201 to produce an envelope sequence <j[n] as described in speech analyzer 
100. The same filtering operation of Equation 2 is used; however, the filter parameters 
A and n 0 are varied as a function of the nominal expected pitch frequency of the tone, 

15 w' oy received via path 1221 according to the relation 

where £ = 2 - cosu£, and n a is calculated using Equation 6. The purpose of this 
20 variation is to adjust the filters selectivity to the expected pitch in order to optimize 
performance. Time-varying gain calculator 1202 transmits a\n) via path 1222 to 
parameter encoder 1210 for subsequent transmission to storage element 1211. 

Overlapping frames of musical signal data and the accompanying frames of en- 
velope data required for analysis are isolated from s[n] and a[n] respectively using 
25 frame segmenter blocks 1203 in the same manner as in speech analyzer 100. Mul- 
tiplier block 1204 responds to a musical signal data frame received via path 1223 
and an envelope data frame received via path 1224 to produce the product of the 
data frames. Analysis window block 1206 multiplies the output of multiplier block 

1204 by the analysis window function described in speech analyzer 100, producing 
30 the product of the sequences x[n] and g[n] defined by Equation 16. Squarer block 

1205 responds to a frame of envelope data to produce the square of the envelope data 
frame; the resulting output is input to a second analysis window block to produce 
the sequence ^[n]. At this point x[n)g[n] and g 2 [n] are input to parallel Fast Fourier 
Transform blocks 1207, which yield the M-point DFT's XG[m] and GG[m] defined 

35 in Equation 44, respectively. 

Harmonically-constrained analysis-by-synthesis block 1208 responds to the input 
DFT's XG[m] and GG[m) and to u>' c to produce sinusoidal model parameters which 
approximate the musical signal data frame. These parameters produce s k [n] using 
the auasi-harmonic reoresentation shown in Eouation 45. The analvsis algorithm 
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used is identical to the fast analysis-by-synthesis algorithm discussed in the descrip- 
tion of speech analyzer 100, with the following exception: Since an unambiguous 
initial fundamental frequency estimate is available, as each candidate frequency u e [t] 
is tested to determine the £-th component of x[n), its harmonic number is calculated 
5 as KM/"')- W this equals the harmonic number of any of the previous £ - 1 compo- 
nents, the candidate is disqualified, ensuring that only one component is associated 
with each harmonic number. As each new component is determined, the estimate of 
o£ is updated according to Equation 47. This algorithm is illustrated in flowchart 

form by Figures 13 through 15. 
10 Harmonically-constrained analysis-by-synthesis block 1208 then transmits the fun- 

damental frequency estimate u> 0 k and the quasi-harmonic model amplitudes {A)), dif- 
ferential frequencies {Ay} and phases {$) via paths 1225, 1226, 1227 and 1228 re- 
spectively, to parameter encoder 1210 for subsequent transmission to storage element 
1211. System estimator 1209 responds to a musical signal data frame transmitted via 
15 path 1223 to produce coefficients representative of H(e""), an estimate of the spectral 
envelope of the quasi-harmonic sinusoidal model parameters. The algorithms which 
may be used to determine these coefficients are the same as those used in system 
estimator 110. System estimator 1209 then transmits said coefficients via path 1229 
to parameter encoder 1210 for subsequent transmission to storage element 1211. 
20 As previously mentioned, the time-varying gain sequence a\n) is not required for 

the model to function; therefore, a second version of a musical tone analyzer that 
operates without said time-varying gain is illustrated in Figure 16. Musical tone 
analyzer 1600 incorporates the same alterations as described in the discussion of 
speech analyzer 200. Furthermore, although the spectral envelope tf (e**) is required 
25 to perform pitch-scale modification of musical signals, when this type of modification 
is not performed the spectral envelope is not required in musical tone analysis. In this 
case, signal paths 1229 and 1620 and functional blocks 1209 and 1601 are omitted 
from analyzers 1200 and 1600, respectively. 

Figure 17 illustrates a synthesizer embodiment of the present invention appropriate 
30 for the synthesis and modification of speech signals. Speech synthesizer 1700 of Figure 
17 responds to stored encoded quasi-harmonic sinusoidal model parameters previously 
determined by speech analysis in order to produce a synthetic facsimile of the original 
analog signal or alternately synthetic speech advantageously modified in time- and/or 
frequency-scale. 

35 Parameter decoder 1702 responds to the stored encoded parameters transmitted 

from storage element 1701 via path 1720 to yield the time-varying gain sequence c[n) 
of Equation 8 (if calculated in analysis), the coefficients associated with vocal tract 
frequency response estimate ff(e>») discussed in the description of speech analyzer 
100. and the fundamental freauencv estimate u£, ouasi-harmonic model amplitudes 
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{4*}, differential frequencies {A*} and phases {<f>^} used to generate a synthetic 
contribution according to Equation 45. Although storage element 1701 is shown to 
be distinct from storage element 113 of speech analyzer 100, it should be understood 
that speech analyzer 100 and speech synthesizer 1700 may share the same storage 
5 element. 

Consider now the operation of speech synthesizer 1700 in greater detail. Referring 
to Equations 12 and 45, time- and frequency-scale modification may be performed 
on isolated synthesis frames, using different time and frequency scale factors in each 
successive frame if desired. A simple approach to time-scale modification by a factor 

10 p k using the overlap-add sinusoidal model is to change the length of synthesis frame k 
from TV, to p k N 3 with corresponding time scaling of the envelope sequence a[n] and the 
synthesis window w 3 [n]. Frequency-scale modification by a factor (3 k is accomplished 
by scaling the component frequencies of each synthetic contribution s k [n]. In either 
case, time shifts are introduced to the modified synthetic contributions to account for 

15 changes in phase coherence due to the modifications. 

Unfortunately, this simple approach yields modified speech with reverberant ar- 
tifacts as well as a noisy, "rough" quality. Examination of Equation 45 reveals why. 
Since the differential frequencies {A*} are nonzero and independent, they cause the 
phase of each component sinusoid to evolve nonuniformly with respect to other com- 

20 ponents. This "phase evolution" results in a breakdown of coherence in the model as 
the time index deviates beyond analysis frame boundaries, as illustrated in Figures 
18A and 18B. Time-shifting this extrapolated sequence therefore introduces incoher- 
ence to the modified speech. 

The present invention overcomes the problem of uncontrolled phase evolution by 

25 altering the component frequencies of s k [n] in the presence of modifications according 
to the relation 

jp k u k + A K :/p k , 

30 This implies that as the time scale factor p k is increased, the component frequencies 
"pull in" towards the harmonic frequencies, and in the limit the synthetic contribu- 
tions become purely periodic sequences. The effect is to slow phase evolution, so that 
coherence breaks down proportionally farther from the analysis frame center to ac- 
count for the longer synthesis frame length. The behavior of a synthetic contribution 

35 modified in this way is illustrated in Figures 19A and 19B. 

Based on this new approach, a synthesis equation similar to Equation 12 may be 
constructed: 

s[n + N k ) = a[- + kN t ){rv t [-]sl^[n) + «,,[- - N,}3^, [n - p k N,)l f 551 
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for 0 < n < p k N„ where N k = N, E*=o Pi is the starting point of the modified 
synthesis frame, and where 

5 J=0 

«L>] = I) ^ l co<ji8H»«^ 1 (n + 0 + ^ + ^ ( 5g ) 

Techniques for determining the time shifts S k and S M will be discussed later. It 
should be noted that when fa > 1, it is possible for the component frequencies of 

10 VaH t0 «cecd 7T, resulting in "aliasing." For this reason it is necessary to set the 
amplitude of any component whose modified frequency is greater than tt to zero. 

Pitch onset time estimator 1703 responds to the coefficients representing H(e 3 ") 
received via path 1721, the fundamental frequency estimate received via path 1722, 
and the quasi-harmonic model amplitudes, differential frequencies and phases received 

15 via paths 1723, 1724 and 1725 respectively in order to estimate the time relative to 
the center of an analysis frame at which an excitation pulse occurs. This function 
is achieved using an algorithm similar to one developed by McAulay and Quatieri in 
"Phase Modelling and its Application to Sinusoidal Transform Coding/* Proc. IEEE 
Int'l ConJ. on AcousL, Speech and Signal Processing, pp. 1713-1715, April 1986, and 

20 based on the observation that the glottal excitation sequence (which is ideally a peri- 
odic pulse train) may be expressed using the quasi-harmonic sinusoidal representation 
of Equations 8 and 45, where the synthetic contributions $ k [n] are replaced by 

/(*] 

e k [n) = £ b k t cos(u/?n + 0 k ) } (57) 
25 <=° 

and where the amplitude and phase parameters of e k [n] are given by 

6* = A k /\H(e»t)\ 

0 k = $-£H{tH). (58) 

30 This process is referred to as "deconvolution." Assuming for simplicity that u/£ = tu% 
and suppressing frame notation, Equation 57 may be rewritten as 



j 

E 

<=0 



e[n] = £ b t cos{lu 0 {n - r p ) + ^(r p )) , (59) 

35 where 



Mr p ) = 0 t + £u o r p . (60) 

One of the properties of the vocal tract frequency response estimate H(e J ") is 
that the amplitude Darameters A% are aooroximatelv orooortional to the magnitude 
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of i/(e JW ) at the corresponding frequencies u/£; thus, the deconvolved amplitude pa- 
rameters {b k } are approximately constant. If, in addition, the "time-shifted" decon- 
volved phase parameters {ipt(r p )} are close to zero or 7r for some value of r p (termed 
"maximal coherence" ), then e k [n] is approximately a periodic pulse train with a "pitch 
5 onset time" of r p . By assuming the condition of maximal coherence, an approximation 
to s k [n] may be constructed by reversing the deconvolution process of Equation 58, 
yielding 

m 

£>] = £ A i - r p ) + LH{4»< ) + run), (61) 

10 <=o 

where m is either zero or one. 

The pitch onset time parameter r p may then be defined as that value of r which 
yields the minimum mean-square error between s k [n) and s k [n] over the original anal- 
ysis frame, 



15 



20 



N a ( J\k) ) 2 

E{t) = £ h*[n]-£ A *^^( n " T ) + ^(^0+^)> • (62) 

n=-N Q [ 1=0 J 

Assuming that N a is a pitch period or more, this is approximately equivalent to 
finding the absolute maximum of the pitch onset likelihood function 

W=£i^COS(fc(T)) ( 63 ) 
/=0 

in terms of r. Unfortunately, this problem does not have a closed-form solution; 
however, due to the form of ^t{r), L(r) is periodic with period 2n/cj 0 . Therefore, the 

25 pitch onset time may be estimated by evaluating L(r) at a number (typically greater 
than 128) of uniformly spaced points on the interval [— 7r/u; 0 ,7r/u; 0 ] and choosing r p 
to correspond to the maximum of \L(r)\, This algorithm is shown in flowchart form 
in Figures 20 and 21. 

DFT assignment block 1704 responds to the fundamental frequency received 

30 via path 1722, the sets of quasi-harmonic model amplitudes, differential frequencies 
and phases received via paths 1723, 1724 and 1725 respectively, pitch onset time 
estimate r* received via path 1726, frequency-scale modification factor /J* and time- 
scale modification factor p k received via paths 1727 and 1728, respectively, to produce 
a sequence Z[i] which may be used to construct a modified synthetic contribution 

35 using an FFT algorithm. 

Consider the operation of DFT assignment block 1704 in greater detail. Referring 
to Equation 10, since the component frequencies of s k [n) are given by u> k = 2nii/M y 
a synthetic contribution may be expressed as 
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J[k] 

s k [n] = £ A\ cos(2-ni t n/M + ( 64 ) 
1=1 

Recognizing that A\ cos{2ni t n/M + <#) = ^{A^ 2 ^ 14 ^}, this becomes 

E4e-^Vj[f J. (65) 

Thus, by Equation 31, any sequence expressed as a sum of constant-amplitude, 
constant-frequency sinusoids whose frequencies are constrained to be multiples of 
2tt/M is alternately given as the real part of the M-point DFT of a sequence Z[i] 
10 with values of A\er^' at i = i t and zero otherwise. This DFT may be calculated 
using an FFT algorithm. 

According to Equation 56, in the presence of time- and frequency-scale modifica- 
tion a synthetic contribution is given by 



15 



20 



25 



j[k] 

^ A N = E^ C * + ^' (66) 

*=0 



where 



Except for the case when p k = p k = 1, the modified frequency terms no longer fall 
at multiples of 2tt/M; however, an FFT algorithm may still be used to accurately 
represent S£ tA [n]. Ignoring frame notation, this is accomplished by calculating the 
DFT indices whose corresponding frequencies are adjacent to w<: 

iu = [£] w 

%2* = iu+h ( 7 °) 

where [ • J denotes the "greatest integer less than or equal to" operator. 

30 The length of the DFT used in modification synthesis, M, is adjusted to compen- 

sate for the longer frame lengths required in time-scale modification and is typically 
greater than or equal to p k M. Each component of s k pkJSk [n] is then approximated using 
two components with frequencies Cj u = 2m u /M and w 2l t = 2iri 2 ,e/M in the follow- 
ing manner: Given a single sinusoidal component with an unconstrained frequency 

35 Cb t of the form 

ct[n) = At cos(w/n + <U) = *t cos Cj t n + b t sin u^n, (71) 
two sinusoids with constrained frequencies are added together to form an 
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approximation to c/[n]: 

c/[n] = A 1}i cos(Q u n -f &/) + A 2X cos(£ 2;/ n + (2,/) 

= ai t / cosu>i/ n -h 6 lt / sin tDi t / n -h 0.2,1 cos t2> 2i * n + 6 2 ,/ sin u> 2>< n. (72) 

Letting N s = p*JV s and using the squared error norm 

£ {«M-*M>V (73) 

n=-Ar, 

minimization of £^ in terms of the coefficients of q[ti] leads to the conditions 

da u da 2ti db u db 2 j [ ) 

Expanding the first condition using Equation 72 yields 

15 £ c4n]coswi/n= £ ci[n]co$u> u n. (75) 

n=-N. n=-N. 

Equations 71 and 72 may be substituted into this equation; however, noting that ~~ 

N 

cos cm s:^ n >n = 0 

20 n=-iV 

for all a, p and N, the resulting expression simplifies to 

N, N. ft. 

oi,* ]P c os 2 & lt< n -f a 2 ,t cosu^ncosu^n == a* 52 cos^ncosd>i./n. (76) 
25 Similarly, the other conditions of Equation 74 are given by the equations 

aij coscD lT /n cos u>2/ 71 + ^2/ ]T] cos 2 u> 2 ^n = a/ ^2 cosa^ncosu> 2 ^n, (77) 



30 



61,/ s ^ n2 71 + ^2/ £ sin <2>i t / n sin u> 2i * n = 6* ^3 s * n ^ n s * n n > (^) 
and 

61/ ^ sin u>i^ n sin u> 2 ^n -f 63,/ £ sin 2 <2> 2 ,/ n = bt ^2 s * n ^ n s * n ^2,/ (79) 

35 n=—N m n=-iV, n=— AT, 

Equations 76 and 77 form a pair of normal equations in the form of Equation 23 
which may be solved using the formulas of Equation 25 for a lf / and a 2 ,<; likewise, 
Equations 78 and 79 are a second, independent pair of normal equations yielding bu 
and 62 1- 
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The inner product terms in Equations 76-79 may be calculated using the relations 
£ cosancos/M = faa - fi) + i*W(« + 0) ^ 

A . a P fft-fll- V cosancos/?n, (81) 

V sin cmsmj3n = ts\ a Pi Lj 



10 



15 



20 



25 



n=-N 



35 



where the function F N {u>), defined as 

. . a sin(2N + lM2 
Fn W ~ sinw/2 

may be ^calculated and used as required. Given the parameters determined from 
the two sets of normal equations, the amplitude and phase parameters of c< N £ 
derived using the relationships of Equation 27. The resulting amphtude and pha* 

at index values i u and i 2 ,*. 

In speech signals, synthetic contributions are highly correlated from one frame 
the next. In the presence of modifications, this correlation must * 
resulting modified speech is to be free from artifact, To accomphsh tins, the time 
shifts J? and |H» in Equation 56 may be determined such that the underlying ex- 
citation signal obeys specific constraints in both the unmodified and modified e~ 
Examining Equation 59, if the component amplitudes are set to unity and the ph- 
set to zero, a "virtual excitation" sequence, or an impulse tram with fundamental 
frequency ^ and shifted relative to the synthesis frame boundary by r samples, 
results In "Phase Coherence in Speech Reconstruction for Enhancement and Cod- 
ing Applications," Proc. IEEE Ml Conf. on AcousL, Speech and Signal Processing 
pp 207-210, May 1989, McAulay and Quatieri derive an algonthm to preserve pfo*e 
coherence in the presence of modifications using virtual excitation analyse The 
following is a description of a refined version of this algonthm 

As illustrated in Figures 22A and 22B, in synthesis frame k the unmodified v rtual 
excitation of the Uh synthetic contribution has pulse locations ^at,ve to frame 
boundary A of * + fl*. where 3* = 2*R These impulses are denoted by O , 
Likewise the pu se locations of the virtual excitation of the (fc + l)-st synthetic 
Likewise, me p „ jt + , . l7 *+i. theS e pulses are denoted 

contribution relative to frame boundary B are r; -rii, , v 
by X's. For some integer u, a pulse location of the fc-tb contribution ,s adjacent to 
fine center C; likewise, for some u- +l a pulse location of the k + 1-st contribution ,s 
adjacent to frame center C. The values of i k and i t+1 can be found as 

ik = l(N./2-r p k )/T^J 

The time difference between the oulses adiacent to frame center C is shown as A. 
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In the presence of time- and frequency-scale modification, the relative virtual 
excitation pulse locations are changed to n = (r£ + tT*)//?* - 6 k and n = (r* +1 + 
iT£+ l )jPk+i - & M for modified synthetic contributions A: and k + 1, respectively. In 
order to preserve frame-to-frame phase coherence in the presence of modifications, 
5 the time shift <5 fc+1 must be adjusted such that the time difference between pulses 
adjacent to modified frame center C is equal to A//? av , where /? a v = {Pk + /?fc+i)/2. 
This condition is also shown in Figures 22 A and 22B. The coherence requirement 
leads to an equation which can be solved for 6 k +\ yielding the recursive relation 

where 



15 4 = + PkN./2) - r*)/T*J . (84) 

The algorithms involved in DFT assignment block 1704 are illustrated in flowchart 
form in Figures 23 and 24. 

FFT block 1705 responds to the complex sequence Z[i] produced by DFT assign- 
20 ment block 1704 to produce a complex sequence z[n] which is the M-point DFT of Z[i] 
according to Equation 31. Overlap-add block 1706 responds to the complex sequence 
output by FFT block 1705, time-scale modification factor p k received via path 1728, 
and time-varying gain sequence a[n] received via path 1729 to produce a contiguous 
sequence s[n], representative of synthetic speech, on a frame-by-frame basis. This is 
25 accomplished in the following manner: Taking the real part of the input sequence z[n] 
yields the modified synthetic contribution sequence s k Pkt p k [n] as in the discussion of 
DFT assignment block 1704. Using the relation expressed in Equation 55, a synthesis 
frame of $[n] is generated by taking two successive modified synthetic contributions, 
multiplying them by shifted and time scaled versions of the synthesis window w 8 [n], 
30 adding the two windowed sequences together, and multiplying the resulting sequence 
by the time scaled time-varying gain sequence a\n\ 

It should be understood that if speech analysis was performed without the time- 
varying gain sequence, then data path 1729 may be omitted from synthesizer 1700, 
and the overlap-add algorithm implemented with a[n) = 1. In addition, it should 
35 be readily apparent to those skilled in the art that if only time-scale modification is 
desired, data path 1727 may be omitted, and the modification algorithms described 
may be implemented with p k = 1 for all fc. Likewise, if only frequency-scale modifica- 
tion is desired, then data path 1728 may be omitted, and the modification algorithms 
described mav be imolemented with py = 1 for all k. 
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Given 5[n], overlap-add block 1706 then produces an output data stream by quan- 
tizing the synthetic speech sequence using a quantization operator as in Equation 1. 
Digital-to-analog (D/A) converter 1707 responds to the data stream produced by 
overlap-add block 1706 to produce an analog signal s c (t) which is output from speech 

5 synthesizer 1700 via path 1730. 

While time- and frequency-scale modification of analyzed speech is sufficient for 
many applications, for certain applications other information must be accounted for 
when performing modifications. For instance, when speech is frequency-scale modi- 
fied using speech synthesizer 1700, the component frequencies used in the sinusoidal 

10 model are changed, but the amplitude parameters are unaltered except as required 
to prevent aliasing; this results in compression or expansion of the "spectral enve- 
lope" of analyzed speech (of which \H(e^)\ is an estimate). Since identifiable speech 
sounds are critically determined by this envelope, such "spectral distortion" may se- 
riously degrade the intelligibility of synthetic speech produced by synthesizer 1700. 

15 Therefore, it is important to consider an approach to altering the fundamental fre- 
quency of speech while preserving its spectral envelope; this is known as pitch-scale 
modification. 

A second version of a speech synthesizer capable of performing time- and pitch- 
scale modification on previously analyzed speech signals is illustrated in Figure 25. 

20 Speech synthesizer 2500 operates identically to speech synthesizer 1700, except that 
an additional step, phasor interpolator 2501, is added to counteract the effects of 
spectral distortion encountered in speech synthesizer 1700. 

Phasor interpolator 2501 responds to the same set of parameters input to pitch 
onset time estimator 1703, the pitch onset time r* determined by pitch onset time 

25 estimator 2502 received via path 2520, and the pitch-scale modification factor /?* 
received via path 2521 in order to determine a modified set of amplitudes {A 1 -}, 
harmonic differential frequencies {A*}, and phases which produce a pitch-scale 
modified version of the original speech data frame. 

Consider now the operation of phasor interpolator 2501 in greater detail: Accord- 

30 ing to the discussion of pitch onset time estimator 1703, a synthetic contribution to 
the glottal excitation sequence as given in Equation 57 is approximately a periodic 
pulse train whose fundamental frequency is u>*. In a manner similar to the pitch- 
excited LPC model, it might be expected that scaling the frequencies of e k [n) by /? fc 
and "reconvolving" with H(e^) at the scaled frequencies would result in syn- 

35 thetic speech with a fundamental frequency of /?*a;* that maintains the same spectral 
shape of H(e jw ) y and therefore the same intelligibility, as the original speech. Unfor- 
tunately, since the frequencies of e k [n] span the range from zero to tt, this approach 
results in component frequencies spanning the range from zero to fan. For pitch 
scale factors less than one. this "information loss" imDarts a muffled aualitv to the 
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modified speech. 

To address this problem, consider the periodic sequence obtained from e k [n] by 
setting u/£ = 

= E b t cos(^n + e k t y (85) 

The goal of modifying the fundamental frequency of e*[n] without information loss is 
to specify a set of modified amplitude and phase parameters for the modified residual 
*fil n ]* g iven by 

10 

j\h] 

$M = E cos(/? fc £wJn + §t) % (86) 

(where J[k] = J[A]//? fc ) which span the frequency range from zero to tt. Since as a 
function of frequency the pairs of amplitude and phase parameters are evenly spaced, 
15 a reasonable approach to this problem is to interpolate the complex "phasor form" 
of the unmodified amplitude and phase parameters across the spectrum and to de- 
rive modified parameters by resampling this interpolated function at the modified 
frequencies. 

Again suppressing frame notation, this implies that given the interpolated function 
20 £(v), where 

j 

^) = EW%-^), (87) 

1=0 

the modified amplitudes are given by b t = \S{fiiLj 0 )\ t and the modified phases by 
25 §t = l£(/3eb> 0 ). 

While any interpolation function I(u) with the properties I{tw 0 ) = 0 for t ^ 0 
and 1(0) = 1 may be employed, a raised-cosine interpolator of the form 

I(u>) = i cos2 (^/2^o), M < »o 
K ) \ 0, otherwise (88 ' 

makes the computation of £ (w) much simpler, since all but two terms drop out of 
Equation 87 at any given frequency. Furthermore, since I(u) is bandlimited, the 
effect of any single noise-corrupted component of e k [n] on the modified parameters is 
strictly limited to the immediate neighborhood of that component's frequency. This 
greatly reduces the problem of inadvertently amplifying background noise during 
modification by assuring that noise effects concentrated in one part of the spectrum 
do not "migrate" to another part of the spectrum where the magnitude of H(e ju} ) 
may be greatly different. 

The discussion of Dhasor interpolation to this Doint has ignored one imoortant 
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factor: the interpolated function £(u>) is seriously affected by the phase terms 
To see this, consider the case when Q t = 0 for all l\ in this case, £(u) is simply a 
straightforward interpolation of the amplitude parameters. However, if every other 
phase term is it instead, £{u>) interpolates adjacent amplitude parameters with op- 

5 posite signs, resulting in a very different set of modified amplitude parameters. It is 
therefore reasonable to formulate phasor interpolation such that the effects of phase 
on the modified amplitudes is minimized. 

As mentioned above, when the phase terms are all close to zero, phasor interpo- 
lation approximates amplitude interpolation. Furthermore, examining Equation 87 

10 reveals that when the phase terms are all close to tt, phasor interpolation is approxi- 
mately interpolation of amplitudes with a sign change, and that deviation from either 
of these conditions results in undesirable nonlinear amplitude interpolation. Recall- 
ing the description of pitch onset time estimator 1703, r p is estimated such that 
the "time-shifted" phase parameters {ipi(r p )} have exactly this property. Therefore, 

15 the phasor interpolation procedure outlined above may be performed using {ipe{r p )} 
instead of {0/}, yielding the modified amplitude parameters {&/} and interpolated 
phases {ipt{r p )}. The modified phase terms may then be derived by reversing the 
time shift imparted to {V'/( t p)} : 

20 0* = ^(r p )-£*oV ' (89) 

At this point all that remains is to specify appropriate differential frequency terms 
in the equation for e k [n). Although this task is somewhat arbitrary, it is reasonable 
to expect that the differential frequency terms may be interpolated uniformly in a 
25 manner similar to phasor interpolation, yielding 

Ai = ^t&il{peu 0 --iu> 0 ). (90) 

This interpolation has the effect that the modified differential frequencies follow the 
30 same trend in the frequency domain as the unmodified differentials, which is impor- 
tant both in preventing migration of noise effects and in modifying speech which 
possesses a noise-like structure in certain portions of the spectrum. 

Given the amplitude, phase and differential frequency parameters of a modified 
excitation contribution, the specification of a synthetic contribution to pitch-scale 
35 modified speech may be completed by reintroducing the effects of the spectral envelope 
to the amplitude and phase parameters at the modified frequencies Cj\ = + A*: 

A k t = b k t p k \H(ei*i)\ 

4t = i$ + £H{e&), (91) 
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where the multiplicative factor of f} k on the amplitude parameters serves to normalize 
the amplitude of the modified speech. The algorithm used in phasor interpolator 2501 
is illustrated in flowchart form in Figures 26 and 27. All other algorithmic components 
of speech synthesizer 2500 and their structural relationships are identical to those of 

5 speech synthesizer 1700. As in speech synthesizer 1700, data path 2522 (which is 
used to transmit time-scale modification factor p k ) may be omitted if only pitch-scale 
modification is desired, and modification may be implemented with p k - 1 for all k. 

Figure 28 illustrates a synthesizer embodiment of the present invention appro- 
priate for the synthesis and modification of pitched musical tone signals. Music 

10 synthesizer 2800 of Figure 28 responds to stored encoded quasi-harmonic sinusoidal 
model parameters previously determined by music signal analysis in order to pro- 
duce a synthetic facsimile of the original analog signal or alternately synthetic speech 
advantageously modified in time- and/or frequency-scale. Parameter decoder 2802 
responds to encoded parameters retrieved from storage element 2801 via path 2820 

15 in a manner similar to parameter encoder 1702 to produce the time- varying gain se- 
quence a[n] of Equation 8 (if calculated in analysis) and the fundamental frequency 
estimate o/J, quasi-harmonic model amplitudes {^}, differential frequencies {A k j} 
and phases used to generate a synthetic contribution according to Equation 45. 
DFT assignment block 2803 responds to the fundamental frequency received via 

20 path 2821, the sets of quasi-harmonic model amplitudes, differential frequencies and 
phases received via paths 2822, 2823 and 2824 respectively, frequency-scale modifica- 
tion factor Pk and time-scale modification factor p k received via paths 2825 and 2826, 
respectively, to produce a sequence Z[i] which may be used to construct a modified 
synthetic contribution using an FFT algorithm. The algorithm used in this block 

25 is identical to that of DFT assignment block 1704 of Figure 17, with the following 
exception: The purpose of the excitation pulse constraint algorithm used to calcu- 
late time shifts 6 k and <5* +1 in DFT assignment block 1704 is that the algorithm is 
relatively insensitive to errors in fundamental frequency estimation resulting in an 
estimate which is the actual fundamental multiplied or divided by an integer factor. 

30 However, for the case of pitched musical tones, such considerations are irrele- 

vant since the fundamental frequency is approximately known a priori. Therefore, 
a simpler constraint may be invoked to determine appropriate time shifts. Specifi- 
cally, denoting the phase terms of the sinusoids in Equation 56 by $£[n] and $; +1 [n] 
respectively, where 



35 




[n] 



^ +1 [n] 



and denoting the unmodified 
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reasonable constraint on the phase behavior of corresponding components from each 
synthetic contribution is to require that the differential between the unmodified phase 
terms at the center of the unmodified synthesis frame match the differential between 
the modified phase terms at the modified frame center. Formally, this requirement is 
5 given by 

*? l [-PkN./2) - »> t N./2] = •H*[-tf J /2] - »}[JV./21, for all j. (93) 

Solving this equation for <5* +1 using the phase functions just defined yields the 
10 recursion 

= Jtsd-zifi + { Pk - l/W/2) + (a - 1/0m)NJ2. (94) 

Note that there is no dependence on j in this recursion, verifying that S M is a global 

15 time shift that needs to be calculated only once per frame. Furthermore, there is no 
dependence on the pitch onset time estimate r* as in DFT assignment block 1704; 
therefore, pitch onset time estimation as in speech synthesizer 1700 is not required for 
music synthesizer 2800. All other algorithmic components of music synthesizer 2800 
and their structural relationships are identical to those of speech synthesizer 1700. 

20 As in speech synthesizer 1700, if only time-scale modification is desired, data path 
2825 may be omitted, and the modification algorithms described may be implemented 
with ft = 1 for all fc. Likewise, if only frequency-scale modification is desired, then 
data path 2826 may be omitted, and the modification algorithms described may be 
implemented with pk = 1 for all k. 

25 A second version of a music synthesizer capable of performing time- and pitch- 

scale modification on previously analyzed musical tone signals is illustrated in Figure 
29. Music synthesizer 2900 operates identically to speech synthesizer 2500, with the 
exception that the time shift parameters used in modification synthesis are calculated 
according to Equation 94. As in speech synthesizer 2500, data path 2921 (which is 

30 used to transmit time-scale modification factor p k ) may be omitted if only pitch-scale 
modification is desired, and modification may be implemented with p k = 1 for all fc. 

The architecture of a possible implementation of an audio analysis/synthesis sys- 
tem using a general-purpose digital signal processing microprocessor is illustrated in 
Figure 30. It should be noted that this implementation is only one of many alternative 
35 embodiments that will be readily apparent to those skilled in the art. For example, 
certain subgroups of the algorithmic components of the various systems may be imple- 
mented in parallel using application-specific IC's (ASIC's), field-programmable gate 
arrays (FPGA's), standard IC's, or discrete components. 
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WHAT IS CLAIMED: 

1 . A method of synthesizing artifact-free modified speech signals from a 
parameter set and a sequence of frequency-scale modification factors, 
5 the parameter set comprising a sequence of coefficient sets representative of a 

sequence of estimates of the frequency response of a human vocal tract, a corresponding 
sequence of estimates of a fundamental frequency, and a corresponding sequence of quasi- 
harmonic sinusoidal model parameter sets; 

each one of the estimates of a fundamental frequency and the corresponding quasi- 
10 harmonic sinusoidal model parameter set comprising a representation of one of a sequence 
of overlapping speech data frames; 

the method comprising the steps of: 

(a) estimating, with a pitch onset time estimator responsive to the sequence of 
coefficient sets, the sequence of estimates of a fundamental frequency, and the sequence of 

1 5 quasi-harmonic sinusoidal model parameter sets, a sequence of excitation times relative to 
the centers of each one of the corresponding overlapping speech data frames in the 
sequence of speech data frames at which an excitation pulse occurs; 

(b) generating a frequency-domain sequence of data frames from a discrete 
Fourier transform assignment means responsive to the sequence of excitation times, the 

20 corresponding sequence of quasi-harmonic sinusoidal model parameter sets, the sequence 
of frequency-scale modification factors, and the sequence of estimates of a fundamental 
frequency, 

(c) transforming the frequency-domain sequence of data frames with an inverse 
discrete Fourier transform means to produce a time-domain sequence of data frames; 

25 (d) generating a contiguous sequence of speech data representative of the 

modified speech signal from an overlap-add means responsive to the time-domain 

sequence of data frames; and 

(e) converting the contiguous sequence of speech data into an analog signal 

using a digital-to-analog converter means to produce the modified speech signal. 
30 2. The method of claim 1 wherein the parameter set further comprises an 

envelope stream representative of time-varying average magnitude, the sequence of 
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overlapping speech data frames is further represented by the envelope stream, and 
the overlap-add means is additionally responsive to the envelope stream. 

3. A method of synthesizing artifact-free modified speech signals from a 
5 parameter set and a sequence of time-scale modification factors, 

the parameter set comprising a sequence of coefficient sets representative of a 
sequence of estimates of the frequency response of a human vocal tract, a corresponding 
sequence of estimates of a fundamental frequency, and a corresponding sequence of quasi- 
harmonic sinusoidal model parameter sets; 
10 each one of the estimates of a fundamental frequency and the corresponding quasi- 

harmonic sinusoidal model parameter set comprising a representation of one of a sequence 
of overlapping speech data frames; 

the method comprising the steps of: 

(a) estimating, with a pitch onset time estimator responsive to the sequence of 
15 coefficient sets, the sequence of estimates of a fundamental frequency, and the sequence of 

quasi-harmonic sinusoidal model parameter sets, a sequence of excitation times relative to 
the centers of each one of the corresponding overlapping speech data frames in the 
sequence of speech data frames at which an excitation pulse occurs; 

(b) generating a frequency-domain sequence of data frames from a discrete 
20 Fourier transform assignment means responsive to the sequence of excitation times, the 

corresponding sequence of quasi-harmonic sinusoidal model parameter sets, the sequence 
of estimates of a fundamental frequency, and the sequence of time-scale modification 
factors; 

(c) transforming the frequency-domain sequence of data frames with an inverse 
25 discrete Fourier transform means to produce a time-domain sequence of data frames; 

(d) generating a contiguous sequence of speech data representative of the 
modified speech signal from an overlap-add means responsive to the time-domain 
sequence of data frames and the sequence of time-scale modification factors; and 

(e) converting the contiguous sequence of speech data into an analog signal 
30 using a digital-to-analog converter means to produce the modified speech signal. 

4. The method of claim 3 wherein the parameter set further comprises an 
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envelope stream representative of time-varying average magnitude, the sequence of 
overlapping speech data frames is further represented by the envelope stream, and 
the overlap-add means is additionally responsive to the envelope stream. 

5 5, A method of synthesizing artifact-free modified speech signals from a 

parameter set and a sequence of pitch-scale modification factors, 

the parameter set comprising a sequence of coefficient sets representative of a 
sequence of estimates of the frequency response of a human vocal tract, a corresponding 
sequence of estimates of a fundamental frequency, and a corresponding sequence of 

10 unmodified quasi-harmonic sinusoidal model parameter sets; 

each one of the estimates of a fundamental frequency and the corresponding quasi- 
harmonic sinusoidal model parameter set comprising a representation of one of a sequence 
of overlapping speech data frames; 

the method comprising the steps of: 

1 5 (a) estimating, with a pitch onset time estimator responsive to the sequence of 

coefficient sets, the sequence of estimates of a fundamental frequency, and the sequence of 
unmodified quasi-harmonic sinusoidal model parameter sets, a sequence of excitation 
times relative to the centers of each one of the corresponding overlapping speech data 
frames in the sequence of speech data frames at which an excitation pulse occurs; 

20 (b) generating a sequence of modified quasi-harmonic sinusoidal model 

parameter sets with a phasor interpolator responsive to the sequence of excitation times, 
the sequence of pitch-scale modification factors, the sequence of estimates of the 
fundamental frequency, the sequence of coefficient sets, and the sequence of unmodified 
quasi-harmonic sinusoidal model parameter sets, each of the modified quasi-harmonic 

25 sinusoidal model parameter sets comprising a set of modified amplitudes, a corresponding 
set of modified frequencies, and a corresponding set of modified phases; 

(c) generating a frequency-domain sequence of data frames from a discrete 
Fourier transform assignment means responsive to the sequence of excitation times, the 
corresponding sequence of modified quasi-harmonic sinusoidal model parameter sets, the 

30 sequence of pitch-scale modification factors, and the sequence of estimates of a 
fundamental frequency; 

(d) transforming the frequency-domain sequence of data frames with an inverse 
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discrete Fourier transform means to produce a time-domain sequence of data frames; 

(e) generating a contiguous sequence of speech data representative of the 
modified speech signal from an overlap-add means responsive to the time-domain 
sequence of data frames; and 
5 (f) converting the contiguous sequence of speech data into an analog signal 

using a digital-to-analog converter means to produce the modified speech signal. 

6. The method of claim 5 wherein the parameter set further comprises an 
envelope stream representative of time-varying average magnitude, the sequence of 

10 overlapping speech data frames is further represented by the envelope stream, and 
the overlap-add means is additionally responsive to the envelope stream. 

7. A method of synthesizing artifact-free modified musical tone signals from a 
parameter set and a sequence of frequency-scale modification factors; 

15 the parameter set comprising a sequence of fundamental frequency estimates and a 

sequence of quasi-harmonic sinusoidal model parameter sets; 
the method comprising the steps of: 

(a) generating a frequency-domain sequence of data frames from a discrete 
Fourier transform assignment means responsive to the sequence of fundamental frequency 

20 estimates, the corresponding sequence of quasi-harmonic sinusoidal model parameter sets, 
and the sequence of frequency-scale modification factors; 

(b) transforming the frequency-domain sequence of data frames with an inverse 
discrete Fourier transform means to produce a time-domain sequence of data frames; 

(c) generating a contiguous sequence of music data representative of the 

25 modified musical tone signals from an overlap-add means responsive to the time-domain 
sequence of data frames; and 

(d) converting the contiguous sequence of music data into an analog signal 
using a digital-to-analog converter means to produce the modified musical tone signal, 

30 8. The method of claim 7 wherein the parameter set further comprises an 

envelope stream representative of time-varying average magnitude, and the overlap-add 
means is additionally responsive to the envelope stream. 
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9. A method of synthesizing artifact-free modified musical tone signals from a 
parameter set and a sequence of time-scale modification factors; 

the parameter set comprising a sequence of fundamental frequency estimates and a 
sequence of quasi-harmonic sinusoidal model parameter sets; 
5 the method comprising the steps of: 

(a) generating a frequency-domain sequence of data frames from a discrete 
Fourier transform assignment means responsive to the sequence of fundamental frequency 
estimates, the corresponding sequence of quasi-harmonic sinusoidal model parameter sets, 
and the sequence of time-scale modification factors; 
10 (b) transforming the frequency-domain sequence of data frames with an inverse 

discrete Fourier transform means to produce a time-domain sequence of data frames; 

(c) generating a contiguous sequence of music data representative of the 
modified musical tone signals from an overlap-add means responsive to the time-domain 
sequence of data frames and the sequence of time-scale modification factors; and 
1 5 (d) converting the contiguous sequence of music data into an analog signal 

using a digital-to-analog converter means to produce the modified musical tone signal. 

10. The method of claim 9 wherein the parameter set further comprises an 
envelope stream representative of time-varying average magnitude, and the overlap-add 

20 means is additionally responsive to the envelope stream. 

11. A method of synthesizing artifact-free modified musical tone signals from a 
parameter set and a sequence of pitch-scale modification factors; 

the parameter set comprising, a sequence of coefficient sets representative of a 
25 sequence of estimates of a spectral envelope, a corresponding sequence of estimates of a 
fundamental frequency, and a corresponding sequence of unmodified quasi-harmonic 
sinusoidal model parameter sets; 

each one of the estimates of a fundamental frequency and the corresponding quasi- 
harmonic sinusoidal model parameter set comprising a representation of one of a sequence 
30 of overlapping musical tone data frames; 

the method comprising the steps of: 
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(a) estimating, with a pitch onset time estimator responsive to the sequence of 
coefficient sets, the sequence of estimates of a fundamental frequency, and the sequence of 
unmodified quasi-harmonic sinusoidal model parameter sets, a sequence of excitation 
times relative to the centers of each one of the corresponding musical tone data frames in 

5 the sequence of musical tone data frames at which an excitation pulse occurs; 

(b) generating a sequence of modified quasi-harmonic sinusoidal model 
parameter sets with a phasor interpolator means responsive to the sequence of excitation 
times, the sequence of pitch-scale modification factors, the sequence of estimates of the 
fundamental frequency, the sequence of coefficient sets and the sequence of unmodified 

10 quasi-harmonic sinusoidal model parameter sets; 

(c) generating a frequency-domain sequence of data frames from a discrete 
Fourier transform assignment means responsive to the sequence of modified quasi- 
harmonic sinusoidal model parameter sets, the sequence of pitch-scale modification 
factors, and the sequence of estimates of a fundamental frequency; 

1 5 (d) transforming the frequency-domain sequence of data frames with an inverse 

discrete Fourier transform means to produce a time-domain sequence of data frames; 
(e) generating a contiguous sequence of musical data representative of the 

modified musical tone signal from an overlap-adder responsive to the time-domain 

sequence of data frames; and 
20 (f) converting the contiguous sequence of musical data into an analog signal 

using a digital-to-analog converter means to produce the modified musical tone signal. 

1 2. The method of claim 1 1 wherein the parameter set further comprises an 
envelope stream representative of time-varying average magnitude, and the overlap-adder 

25 is additionally responsive to the envelope stream. 

13. An apparatus for generating a signal representative of a synthetic speech 
waveform from a set of parameters representative of overlapping speech data frames 
stored in a memory means, and a sequence of frequency scale modification factors; 

30 the set of parameters comprising a sequence of quasi-harmonic sinusoidal model 

parameter sets, a sequence of coefficient sets representative of a frequency response of a 
human vocal tract, and a sequence of fundamental frequency estimates, 
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the apparatus comprising: 

(a) a pitch onset time estimator means electrically coupled to the memory 
means and responsive to the sequence of coefficient sets, the sequence of fundamental 
frequency estimates, and the sequence of quasi-harmonic sinusoidal model parameter sets 

5 for generating a first signal representative of a sequence of excitation times relative to the 
center of each of the corresponding speech data frames at which an excitation pulse 
occurs; 

(b) a discrete Fourier transform assignment means electrically coupled to the 
memory means and responsive to the sequence of fundamental frequency estimates, the 

10 sequence of quasi-harmonic sinusoidal model parameter sets, the first signal, and the 
sequence of frequency-scale modification factors for producing a second signal from 
which a modified synthetic contribution may be generated using a discrete Fourier 
transform algorithm; 

(c) a discrete Fourier transform means responsive to the second signal for 
15 generating a transformed signal; and 

(d) an overlap-add means responsive to the transformed signal for generating 
the signal representative of the synthetic speech waveform. 

14. The apparatus of claim 13, wherein the speech information further comprises 
20 an envelope stream representative of time-varying average magnitude, and the overlap-add 

means is electrically coupled to the memory means and is additionally responsive to the 
envelope stream. 

15. An apparatus for generating a signal representative of a synthetic speech 
25 waveform from a set of parameters representative of overlapping speech data frames 

stored in a memory means and a sequence of time-scale modification factors, 

the set of parameters comprising a sequence of quasi-harmonic sinusoidal model 
parameter sets, a sequence of coefficient sets representative of a frequency response of a 
human vocal tract, and a sequence of fundamental frequency estimates, the apparatus 
30 comprising: 

(a) a pitch onset time estimator means electrically coupled to the memory 
means and responsive to the sequence of coefficient sets, the sequence of fundamental 
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frequency estimates, and the sequence of quasi-harmonic sinusoidal model parameter sets 
for generating a first signal representative of a sequence of excitation times relative to the 
center of each of the corresponding speech data frames at which an excitation pulse 
occurs; 

5 (b) a discrete Fourier transform assignment means electrically coupled to the 

memory means and responsive to the sequence of fundamental frequency estimates, the 
sequence of quasi-harmonic sinusoidal model parameter sets, the first signal, and the 
sequence of time-scale modification factors for producing a second signal from which a 
modified synthetic contribution may be generated using a discrete Fourier transform 

10 algorithm; 

(c) a discrete Fourier transform means responsive to the second signal for 
generating a transformed signal; and 

(d) an overlap-add means responsive to the transformed signal and the 
sequence of time-scale modification factors for generating the signal representative of the 

1 5 synthetic speech waveform. 

16. The apparatus of claim 15, wherein the speech information further 
comprises an envelope stream representative of time-varying average magnitude, and the 
overlap-add means is electrically coupled to the memory means and is additionally 

20 responsive to the envelope stream. 

17. An apparatus for generating a synthetic speech waveform from a set of 
parameters representative of overlapping speech data frames stored in a memory means 
and a sequence of pitch-scale modification factors; 

25 the speech information comprising a sequence of quasi-harmonic sinusoidal model 

parameter sets, a sequence of coefficient sets representative of a frequency response of a 
human vocal tract, and a sequence of fundamental frequency estimates, 
the apparatus comprising: 

(a) a pitch onset time estimator means electrically coupled to the memory means 
30 and responsive to the sequence of coefficient sets, the sequence of fundamental frequency 
estimates, and the sequence of quasi-harmonic sinusoidal model parameter sets for 
generating a first signal representative of a sequence of time estimates relative to the center 
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of each of the frames at which an excitation pulse occurs; 

(b) a phasor interpolator means electrically coupled to the memory means and the 
pitch onset time estimator means and responsive to the sequence of coefficient sets, the 
sequence of fundamental frequency estimates, the sequence of quasi-harmonic sinusoidal 

5 model parameter sets, the first signal, and the sequence of pitch-scale modification factors 
for generating a sequence of modified quasi-harmonic sinusoidal model parameter sets; 

(c) a discrete Fourier transform assignment means electrically coupled to the 
phasor interpolator means and the pitch onset time estimator means and responsive to the 
sequence of fundamental frequency estimates, the sequence of modified quasi-harmonic 

10 sinusoidal model parameter sets, the first signal and the sequence of pitch-scale 
modification factors for producing a second signal from which a modified synthetic 
contribution may be generated using a discrete Fourier transform algorithm; 

(d) a discrete Fourier transform means responsive to the second signal for 
generating a transformed signal; and 

15 (e) an overlap-add means responsive to the transformed signal for generating the 

signal representative of the synthetic speech waveform. 

18. The apparatus of claim 17, wherein the speech information further comprises 
an envelope stream representative of time-varying average magnitude, and the overlap-add 

20 means is electrically coupled to the memory means and is additionally responsive to the 
envelope stream. 

1 9. An apparatus for generating a signal representative of a synthetic musical 
waveform from a set of parameters representative of overlapping musical tone data frames 

25 stored in a memory means and a sequence of frequency scale modification factors; 

the parameter set comprising a sequence of quasi-harmonic sinusoidal model 
parameter sets and a sequence of fundamental frequency estimates, 
the apparatus comprising: 

(a) a discrete Fourier transform assignment means electrically coupled to the 
30 memory means and responsive to the sequence of fundamental frequency estimates, the 
sequence of quasi-harmonic sinusoidal model parameter sets, and the sequence of 
frequency-scale modification factors for producing a first signal from which a modified 
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synthetic contribution may be generated using a discrete Fourier transform algorithm; 

(b) a discrete Fourier transform means responsive to the first signal for generating a 

transformed signal; and 

(c) an overlap-add means responsive to the transformed signal for generating the 
5 signal representative of the synthetic musical waveform. 

20. The apparatus of claim 19 wherein the musical information further comprises 
an envelope stream representative of time-varying average magnitude, and the overlap-add 
means is electrically coupled to the memory means and is additionally responsive to the 

10 envelope stream. 

21. An apparatus for generating a signal representative of a synthetic musical 
waveform from a set of parameters representative of overlapping musical tone data frames 
stored in a memory means and a sequence of time-scale modification factors; 

1 5 the parameter set comprising a sequence of quasi-harmonic sinusoidal model 

parameter sets and a sequence of fundamental frequency estimates, 
the apparatus comprising: 

(a) a discrete Fourier transform assignment means electrically coupled to the 
memory means and responsive to the sequence of fundamental frequency estimates, the 

20 sequence of quasi-harmonic sinusoidal model parameter sets, and the sequence of time- 
scale modification factors for producing a first signal from which a modified synthetic 
contribution may be generated using a discrete Fourier transform algorithm; 

(b) a discrete Fourier transform means responsive to the first signal for generating a 

transformed signal; and 
25 (c) an overlap-add means responsive to the transformed signal and the sequence of 

time-scale modification factors for generating the signal representative of the synthetic 
musical waveform. 

22. The apparatus of claim 21 wherein the musical information further comprises 
30 an envelope stream representative of time-varying average magnitude, and the overlap-add 
means is electrically coupled to the memory means and is additionally responsive to the 
envelope stream. 
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23. An apparatus for generating a signal representative of a synthetic musical tone 
waveform from a set of parameters representative of overlapping frames of musical data 
stored in a memory means and a sequence of pitch-scale modification factors; 

the musical information comprising a sequence of quasi-harmonic sinusoidal 
5 model parameter sets, a sequence of coefficient sets representative of estimates of a 
spectral envelope, and a sequence of fundamental frequency estimates, 

the apparatus comprising: 

(a) a pitch onset time estimator means electrically coupled to the memory means 
and responsive to the sequence of coefficient sets, the sequence of fundamental frequency 

10 estimates, and the sequence of quasi-harmonic sinusoidal model parameter sets for 

generating a first signal representative of a sequence of time estimates relative to the center 
of each of the frames at which an excitation pulse occurs; 

(b) a phasor interpolator means electrically coupled to the memory means and the 
pitch onset time estimator means and responsive to the sequence of coefficient sets, the 

1 5 sequence of fundamental frequency estimates, the sequence of quasi-harmonic sinusoidal 
model parameter sets, the first signal, and the sequence of pitch-scale modification factors 
for generating a sequence of modified quasi-harmonic sinusoidal model parameter sets; 

(c) a discrete Fourier transform assignment means electrically coupled to the 
phasor interpolator means and responsive to the sequence of fundamental frequency 

20 estimates, the sequence of modified quasi-harmonic sinusoidal model parameter sets, and 
the sequence of pitch-scale modification factors for producing a second signal from which 
a modified synthetic contribution may be generated using a discrete Fourier transform 
algorithm; 

(d) a discrete Fourier transform means responsive to the second signal for 
25 generating a transformed signal; and 

(e) an overlap-add means responsive to the transformed signal for generating the 
representative of the synthetic musical tone waveform, 

24. The apparatus of claim 23, wherein the musical information further comprises 
30 an envelope stream representative of time-varying average magnitude, and the overlap-add 
means is electrically coupled to the memory means and is additionally responsive to the 
envelope stream. 
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